What is Pose Transfer? Pose transfer refers to the image generation of a person with a previously unseen novel pose from another image of a person displaying that pose. The pose of the source is applied to the target.
Papers and Code
Jan 21, 2025
Abstract:Audio deepfakes pose significant threats, including impersonation, fraud, and reputation damage. To address these risks, audio deepfake detection (ADD) techniques have been developed, demonstrating success on benchmarks like ASVspoof2019. However, their resilience against transferable adversarial attacks remains largely unexplored. In this paper, we introduce a transferable GAN-based adversarial attack framework to evaluate the effectiveness of state-of-the-art (SOTA) ADD systems. By leveraging an ensemble of surrogate ADD models and a discriminator, the proposed approach generates transferable adversarial attacks that better reflect real-world scenarios. Unlike previous methods, the proposed framework incorporates a self-supervised audio model to ensure transcription and perceptual integrity, resulting in high-quality adversarial attacks. Experimental results on benchmark dataset reveal that SOTA ADD systems exhibit significant vulnerabilities, with accuracies dropping from 98% to 26%, 92% to 54%, and 94% to 84% in white-box, gray-box, and black-box scenarios, respectively. When tested in other data sets, performance drops of 91% to 46%, and 94% to 67% were observed against the In-the-Wild and WaveFake data sets, respectively. These results highlight the significant vulnerabilities of existing ADD systems and emphasize the need to enhance their robustness against advanced adversarial threats to ensure security and reliability.
* WACV 2025
Via
Jan 20, 2025
Abstract:Transfer learning, successful in knowledge translation across related tasks, faces a substantial privacy threat from membership inference attacks (MIAs). These attacks, despite posing significant risk to ML model's training data, remain limited-explored in transfer learning. The interaction between teacher and student models in transfer learning has not been thoroughly explored in MIAs, potentially resulting in an under-examined aspect of privacy vulnerabilities within transfer learning. In this paper, we propose a new MIA vector against transfer learning, to determine whether a specific data point was used to train the teacher model while only accessing the student model in a white-box setting. Our method delves into the intricate relationship between teacher and student models, analyzing the discrepancies in hidden layer representations between the student model and its shadow counterpart. These identified differences are then adeptly utilized to refine the shadow model's training process and to inform membership inference decisions effectively. Our method, evaluated across four datasets in diverse transfer learning tasks, reveals that even when an attacker only has access to the student model, the teacher model's training data remains susceptible to MIAs. We believe our work unveils the unexplored risk of membership inference in transfer learning.
Via
Jan 20, 2025
Abstract:Research has increasingly explored the application of artificial intelligence (AI) and machine learning (ML) within the mental health domain to enhance both patient care and healthcare provider efficiency. Given that mental health challenges frequently emerge during early adolescence -- the critical years of high school and college -- investigating AI/ML-driven mental health solutions within the education domain is of paramount importance. Nevertheless, conventional AI/ML techniques follow a centralized model training architecture, which poses privacy risks due to the need for transferring students' sensitive data from institutions, universities, and clinics to central servers. Federated learning (FL) has emerged as a solution to address these risks by enabling distributed model training while maintaining data privacy. Despite its potential, research on applying FL to analyze students' mental health remains limited. In this paper, we aim to address this limitation by proposing a roadmap for integrating FL into mental health data analysis within educational settings. We begin by providing an overview of mental health issues among students and reviewing existing studies where ML has been applied to address these challenges. Next, we examine broader applications of FL in the mental health domain to emphasize the lack of focus on educational contexts. Finally, we propose promising research directions focused on using FL to address mental health issues in the education sector, which entails discussing the synergies between the proposed directions with broader human-centered domains. By categorizing the proposed research directions into short- and long-term strategies and highlighting the unique challenges at each stage, we aim to encourage the development of privacy-conscious AI/ML-driven mental health solutions.
* 18 pages, 1 figure, 4 tables
Via
Jan 20, 2025
Abstract:We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a single human image using facial expressions and body movements derived from a driving video, that generates realistic, context-aware dynamics for both the subject and the surrounding environment. Building on prior approaches centered on human pose control, X-Dyna addresses key shortcomings causing the loss of dynamic details, enhancing the lifelike qualities of human video animations. At the core of our approach is the Dynamics-Adapter, a lightweight module that effectively integrates reference appearance context into the spatial attentions of the diffusion backbone while preserving the capacity of motion modules in synthesizing fluid and intricate dynamic details. Beyond body pose control, we connect a local control module with our model to capture identity-disentangled facial expressions, facilitating accurate expression transfer for enhanced realism in animated scenes. Together, these components form a unified framework capable of learning physical human motion and natural scene dynamics from a diverse blend of human and scene videos. Comprehensive qualitative and quantitative evaluations demonstrate that X-Dyna outperforms state-of-the-art methods, creating highly lifelike and expressive animations. The code is available at https://github.com/bytedance/X-Dyna.
Via
Jan 14, 2025
Abstract:Pose distillation is widely adopted to reduce model size in human pose estimation. However, existing methods primarily emphasize the transfer of teacher knowledge while often neglecting the performance degradation resulted from the curse of capacity gap between teacher and student. To address this issue, we propose AgentPose, a novel pose distillation method that integrates a feature agent to model the distribution of teacher features and progressively aligns the distribution of student features with that of the teacher feature, effectively overcoming the capacity gap and enhancing the ability of knowledge transfer. Our comprehensive experiments conducted on the COCO dataset substantiate the effectiveness of our method in knowledge transfer, particularly in scenarios with a high capacity gap.
* 5 pages, 1 figures
Via
Jan 15, 2025
Abstract:Face swapping transfers the identity of a source face to a target face while retaining the attributes like expression, pose, hair, and background of the target face. Advanced face swapping methods have achieved attractive results. However, these methods often inadvertently transfer identity information from the target face, compromising expression-related details and accurate identity. We propose a novel method DynamicFace that leverages the power of diffusion model and plug-and-play temporal layers for video face swapping. First, we introduce four fine-grained face conditions using 3D facial priors. All conditions are designed to be disentangled from each other for precise and unique control. Then, we adopt Face Former and ReferenceNet for high-level and detailed identity injection. Through experiments on the FF++ dataset, we demonstrate that our method achieves state-of-the-art results in face swapping, showcasing superior image quality, identity preservation, and expression accuracy. Besides, our method could be easily transferred to video domain with temporal attention layer. Our code and results will be available on the project page: https://dynamic-face.github.io/
Via
Jan 15, 2025
Abstract:In this paper, we propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task. Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities. Specifically, we propose the novel XingGAN (or CrossingGAN), which consists of two generation branches that capture the person's appearance and shape, respectively. Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person's shape and appearance embeddings for mutual improvement. This has not been considered by any other existing GAN-based image generation work. To further learn the long-range correlations between different person poses at different scales and sub-regions, we propose two novel multi-scale cross-attention blocks. To tackle the issue of independent correlation computations within the cross-attention mechanism leading to noisy and ambiguous attention weights, which hinder performance improvements, we propose a module called enhanced attention (EA). Lastly, we introduce a novel densely connected co-attention module to fuse appearance and shape features at different stages effectively. Extensive experiments on two public datasets demonstrate that the proposed method outperforms current GAN-based methods and performs on par with diffusion-based methods. However, our method is significantly faster than diffusion-based methods in both training and inference.
* Accepted to TPAMI, an extended version of a paper published in
ECCV2020. arXiv admin note: substantial text overlap with arXiv:2007.09278
Via
Jan 13, 2025
Abstract:Predicting the impact of single-point amino acid mutations on protein stability is essential for understanding disease mechanisms and advancing drug development. Protein stability, quantified by changes in Gibbs free energy ($\Delta\Delta G$), is influenced by these mutations. However, the scarcity of data and the complexity of model interpretation pose challenges in accurately predicting stability changes. This study proposes the application of deep neural networks, leveraging transfer learning and fusing complementary information from different models, to create a feature-rich representation of the protein stability landscape. We developed four models, with our third model, ThermoMPNN+, demonstrating the best performance in predicting $\Delta\Delta G$ values. This approach, which integrates diverse feature sets and embeddings through latent transfusion techniques, aims to refine $\Delta\Delta G$ predictions and contribute to a deeper understanding of protein dynamics, potentially leading to advancements in disease research and drug discovery.
Via
Jan 15, 2025
Abstract:Seasickness is a prevalent issue that adversely impacts both passenger experiences and the operational efficiency of maritime crews. While techniques that redirect attention have proven effective in alleviating motion sickness symptoms in terrestrial environments, applying similar strategies to manage seasickness poses unique challenges due to the prolonged and intense motion environment associated with maritime travel. In this study, we propose a mindfulness brain-computer interface (BCI), specifically designed to redirect attention with the aim of mitigating seasickness symptoms in real-world settings. Our system utilizes a single-channel headband to capture prefrontal EEG signals, which are then wirelessly transmitted to computing devices for the assessment of mindfulness states. The results are transferred into real-time feedback as mindfulness scores and audiovisual stimuli, facilitating a shift in attentional focus from physiological discomfort to mindfulness practices. A total of 43 individuals participated in a real-world maritime experiment consisted of three sessions: a real-feedback mindfulness session, a resting session, and a pseudofeedback mindfulness session. Notably, 81.39% of participants reported that the mindfulness BCI intervention was effective, and there was a significant reduction in the severity of seasickness, as measured by the Misery Scale (MISC). Furthermore, EEG analysis revealed a decrease in the theta/beta ratio, corresponding with the alleviation of seasickness symptoms. A decrease in overall EEG band power during the real-feedback mindfulness session suggests that the mindfulness BCI fosters a more tranquil and downregulated state of brain activity. Together, this study presents a novel nonpharmacological, portable, and effective approach for seasickness intervention, with the potential to enhance the cruising experience for both passengers and crews.
Via
Jan 16, 2025
Abstract:Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. 1) For data scaling, we perform a systematic investigation on 40 EHPS datasets, encompassing a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. Ultimately, we achieve diminishing returns at 10M training instances from diverse data sources. 2) For model scaling, we take advantage of vision transformers (up to ViT-Huge as the backbone) to study the scaling law of model sizes in EHPS. To exclude the influence of algorithmic design, we base our experiments on two minimalist architectures: SMPLer-X, which consists of an intermediate step for hand and face localization, and SMPLest-X, an even simpler version that reduces the network to its bare essentials and highlights significant advances in the capture of articulated hands. With big data and the large model, the foundation models exhibit strong performance across diverse test benchmarks and excellent transferability to even unseen environments. Moreover, our finetuning strategy turns the generalist into specialist models, allowing them to achieve further performance boosts. Notably, our foundation models consistently deliver state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and our proposed SynHand dataset for comprehensive hand evaluation. (Code is available at: https://github.com/wqyin/SMPLest-X).
* An extension of SMPLer-X [arXiv:2309.17448]. Homepage:
https://caizhongang.com/projects/SMPLer-X/
Via